"VISUS-Krüger-MC1"

VAST 2012 Challenge
Mini-Challenge 1: Bank of Money Enterprise: Cyber Situation Awareness

Team Members:

Robert Krüger, Institute for Visualization and Interactive Systems, University of Stuttgart, kruegert@vis.uni-stuttgart.de PRIMARY
Harald Bosch, Institute for Visualization and Interactive Systems, University of Stuttgart, boschhd@vis.uni-stuttgart.de
Steffen Koch, Institute for Visualization and Interactive Systems, University of Stuttgart, kochsn@vis.uni-stuttgart.de
Christoph Müller, Visualization Research Center, University of Stuttgart, mueller@visus.uni-stuttgart.de
Guido Reina, Visualization Research Center, University of Stuttgart, reina@visus.uni-stuttgart.de
Dennis Thom, Institute for Visualization and Interactive Systems, University of Stuttgart, thomds@vis.uni-stuttgart.de
Thomas Ertl, Institute for Visualization and Interactive Systems, University of Stuttgart, ertl@vis.uni-stuttgart.de

Student Team: NO

Tool(s):

custom, developed by the Institute for Visualization and Interactive Systems and Visualization Research Center of the University of Stuttgart.

Video:

VISUS-Krueger-MC1.wmv

Answers to Mini-Challenge 1 Questions:

MC 1.1 Create a visualization of the health and policy status of the entire Bank of Money enterprise as of 2 pm BMT (BankWorld Mean Time) on February 2. What areas of concern do you observe?

In our map view, showing locations and mean values of each facility, we observe the majority of machines in most regions being in state “healthy” and only some suffering from a moderate policy deviation. However, it is also obvious that in region-5 and region-10 there is not a single machine that is in a healthy state. Using temporal and spatial filtering on the map and histogram views one can see that on 14:00 BMT in 16 facilities in the Atta region (region-25) comprising 1,571 machines not a single machine sends state reports although this is within business hours:

map view: inactive facilities as crosses, respective machines in state '0' in histogram

These facilities are possibly completely offline. Datacenter-5 in region-10 also shows a disproportionately high number of machines not logging their states. Only 2,090 of 49,000 machines are reporting data here.
BoM employees do not seem to adhere to the BoM policy to turn off their machines overnight whenever possible. We get an overview of this behavior in the map from many state reports during night time by using the time slider after filtering for workstation machines, further analyzing this behavior in the matrix view:

map view displaying many machines logging at night; matrix view (right) shows shut-down machines in black

Using the swim lanes overview, one can observe that only a relatively small number of machines exhibit critical policy deviations while a single outlier, a compute server (172.2.194.20) in datacenter-2, already reports the “infected” state:

first sighting of infection in matrix view and swim lanes

The matrix view in the image shows the machine state history and demonstrates that the infection started at 03:30 local time without any suspicious activity reported before. Starting 23:45 local time, the machine's state, however, quickly rose from 2 to 4.

MC 1.2 Use your visualization tools to look at how the network’s status changes over time. Highlight up to five potential anomalies in the network and provide a visualization of each. When did each anomaly begin and end? What might be an explanation of each anomaly?

Starting 12:15 BMT (09:15 local time) on the first day, we see whole facilities in Atta going completely offline. Examining this behavior with the time slider we see the first two facilities “disappear” in the south of Atta, and subsequently the front of “disconnected/offline” facilities moving north. Only 16 facilities in the very northwest of the region are still online at 18:15 BMT:

cross markers indicating offline facilities

The first facilities start coming back on 23:30 local time from southwest to northeast. In the microanalysis using the matrix view, we can see that some of the machines in those facilities do not come back before 04:30 BMT. We assume that this anomaly is caused by extraneous factors like a power outage or a network failure rather than by a planned maintenance for several reasons:

the machines do not log the “going down for maintenance” activity before going offline, which happens only in Atta and in datacenter-5
the outage occurs within business hours
the outage moves in a geographical direction distinct from typical, periodic effects such as caused by nightfall/daybreak (northwards)

This supports the hypothesis of having detected an external anomaly. Although protecting infrastructure against such events is an expensive task, BoM should ensure that at least the critical backbone infrastructure is protected by a sufficiently-dimensioned UPS and that there are redundant network paths as losing connection to a whole business unit is hardly acceptable.

All staff of Bank of Money is encouraged to turn off workstations at night. However, this rule is only followed for ~60% of all workstations. In order to discover relevant transitions between single states (e.g. policy) and combined states (e.g. policy/connection) of a large number of machines, we employ an aggregated state graph visualization. To examine the employees' workstation behavior, we subsequently select all loan, teller and office machines and observe their combined policy/state transitions.

Left: state transitions in office machines, right: teller machines. Numbers indicate connection quantile/policy combinations.

At this point we see an anomaly in connection numbers exhibited only by teller machines in certain policy states. To further investigate this phenomenon we select these machines and analyze their behavior over time using a parallel coordinates visualization showing 29 teller machines with a disproportionately high number of up to 100 connections on the first night between 02:15 and 05:00 local time. By highlighting these machines in that view we can observe that they are all associated with the same business unit and by looking on the map, we see them distributed over region-10.

night-time activity and high connection count on teller machines in region-10

In the following night at the very same time frame, already 893 of 2,548 machines, including the 29 of the first night, show the same behavior. The inspection of the logged activities does not reveal any specific event that could explain this. We suggest letting region-10 staff investigate if the increase in connections was caused by BoM infrastructure or not. In the latter case, the machines' suspicious, parallel behavior might be caused by a bot net. And, even worse, this would not have been detected as policy deviations, which might pose the need of updating virus scanners and/or policy definition.

Our line chart view indicates that a serious health state affection seems to grow exponentially over the BoM network, which we interpret as a spreading virus. Even more, our state transition graph shows that not a single machine improves its policy state over time and that the degradation happens in a gradual fashion without skipping states.

right: line chart showing health deterioration, left: state transition graph

Observations show that the current maintenance plan seems not to be effective: Although we can see regular maintenance activities in the machines state histories, these activities seem to have neither slowing nor stopping effects on the degradation of policy states. Our set-based selection management tool reveals that of the 6,379 machines reporting the “infected” status, 1,075 have logged a maintenance activity and these maintenance activities are evenly distributed over all policy states.

Set-based selection management that helps (in)validating the analyst's hypothesis. Note that none of the machines that receive maintenance in policy state five ever improve in policy afterwards (top row).

This distribution is the same for the ~336,000 machines staying healthy, i.e. problematic machines are not handled with higher priority. Only a few machines reach policy state 5 without reporting critical policy deviations before, and these are all offline immediately before being infected. These transitions however prove that the maintenance strategy in BoM is in dire need of improvement. Currently, deteriorated machines cannot be recovered at all.

We could not detect a direct relation between attached USB devices and subsequent virus infection. We assume the early infection of datacenters might boost the spreading of the infection.

The map shows that region-5 and region-10 already have higher-than-average policy states from the beginning. Investigating the histogram of these regions, it can be seen that not a single machine has reported a healthy status. Instead, they all start with moderate deviations. Besides this difference, the machines do not seem to develop worse than the rest of the BoM network. We still think it is indispensable that BoM reacts to such large-scale deviations in a timely fashion, but detect no indication of measures taken from the sampled data.

At the beginning of the time series, only three machines in datacenter-5 in region-10 report their state. All of them are office machines. The remaining 51,327 machines are offline. At 04:45 local time, 240 servers become online, but only for one hour. The policy line plot shows that, in the following hours large groups of machines start operating. About 15 percent of them start with a moderate policy deviation, 1% with serious or critical deviations, and one machine already reports a possible virus.

line plot and matrix view of datacenter-5 going operational

We hypothesize that datacentre-5 is put into operation for the first time wherefore a smaller number of servers is tested before the vast majority is powered up.

"VISUS-Krüger-MC1"

VAST 2012 Challenge Mini-Challenge 1: Bank of Money Enterprise: Cyber Situation Awareness

Team Members:

Tool(s):

VAST 2012 Challenge
Mini-Challenge 1: Bank of Money Enterprise: Cyber Situation Awareness